17 research outputs found
The shocklet transform: a decomposition method for the identification of local, mechanism-driven dynamics in sociotechnical time series
We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series—termed the Discrete Shocklet Transform (DST)—and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous behavior. After distinguishing our algorithms from other methods used in anomaly detection and time series similarity search, such as the matrix profile, seasonal-hybrid ESD, and discrete wavelet transform-based procedures, we demonstrate the DST’s ability to identify mechanism-driven dynamics at a wide range of timescales and its relative insensitivity to functional parameterization. As an application, we analyze a sociotechnical data source (usage frequencies for a subset of words on Twitter) and highlight our algorithms’ utility by using them to extract both a typology of mechanistic local dynamics and a data-driven narrative of socially-important events as perceived by English-language Twitter
Hurricanes and hashtags: Characterizing online collective attention for natural disasters
We study collective attention paid towards hurricanes through the lens of
-grams on Twitter, a social media platform with global reach. Using
hurricane name mentions as a proxy for awareness, we find that the exogenous
temporal dynamics are remarkably similar across storms, but that overall
collective attention varies widely even among storms causing comparable deaths
and damage. We construct `hurricane attention maps' and observe that hurricanes
causing deaths on (or economic damage to) the continental United States
generate substantially more attention in English language tweets than those
that do not. We find that a hurricane's Saffir-Simpson wind scale category
assignment is strongly associated with the amount of attention it receives.
Higher category storms receive higher proportional increases of attention per
proportional increases in number of deaths or dollars of damage, than lower
category storms. The most damaging and deadly storms of the 2010s, Hurricanes
Harvey and Maria, generated the most attention and were remembered the longest,
respectively. On average, a category 5 storm receives 4.6 times more attention
than a category 1 storm causing the same number of deaths and economic damage.Comment: 31 pages (14 main, 17 Supplemental), 19 figures (5 main, 14 appendix
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter
In real-time, Twitter strongly imprints world events, popular culture, and the day-to-day; Twitter records an ever growing compendium of language use and change; and Twitter has been shown to enable certain kinds of prediction. Vitally, and absent from many standard corpora such as books and news archives, Twitter also encodes popularity and spreading through retweets. Here, we describe Storywrangler, an ongoing, day-scale curation of over 100 billion tweets containing around 1 trillion 1-grams from 2008 to 2020. For each day, we break tweets into 1-, 2-, and 3-grams across 150+ languages, record usage frequencies, and generate Zipf distributions. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through ‘contagiograms’
Recommended from our members
Giant modulation of the electronic band gap of carbon nanotubes by dielectric screening
Ion pairs and solubility related to ion-pairing in water influence many processes in nature and in synthesis including efficient drug delivery, contaminant transport in the environment, and self-assembly of materials in water. Ion pairs are difficult to observe spectroscopically because they generally do not persist unless extreme solution conditions are applied. Here we demonstrate two advanced techniques coupled with computational studies that quantify the persistence of ion pairs in simple solutions and offer explanations for observed solubility trends. The system of study, ([(CH₃)₄N]+,Cs)₈[M6O₁₉] (M = Nb,Ta), is a set of unique polyoxometalate salts whose water solubility increases with increasing ion-pairing, contrary to most ionic salts. The techniques employed to characterize Cs+ association with [M₆O₁₉]⁸⁻ and related clusters in simple aqueous media are ¹³³Cs NMR (nuclear magnetic resonance) quadrupolar relaxation rate and PDF (pair distribution function) from X-ray scattering. The NMR measurements consistently showed more extensive ion-pairing of Cs+ with the Ta-analogue than the Nb-analogue, although the electrostatics of the ions should be identical. Computational studies also ascertained more persistent Cs+–[Ta₆O₁₉] ion pairs than Cs+–[Nb₆O₁₉] ion pairs, and bond energy decomposition analyses determined relativistic effects to be the differentiating factor between the two. These distinctions are likely responsible for many of the unexplained differences between aqueous Nb and Ta chemistry, while they are so similar in the solid state. The X-ray scattering studies show atomic level detail of this ion association that has not been prior observed, enabling confidence in our structures for calculations of Cs-cluster association energies. Moreover, detailed NMR studies allow quantification of the number of Cs+ associated with a single [Nb₆O₁₉]⁸⁻ or [Ta₆O₁₉]⁸⁻ anion which agrees with the PDF analyses
Interpretable bias mitigation for textual data: Reducing gender bias in patient notes while maintaining classification performance
Medical systems in general, and patient treatment decisions and outcomes in particular, are affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models---statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how word choices made by healthcare practitioners and language models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce the potential for bias in natural language processing pipelines